[1] "Sheldon"
University of Potsdam
11/17/22
Selecting elements of a data structure.
By providing a number within square brackets, the respective element is selected from a vector:
When you provide a vector of numbers, multiple elements are selected
You can even change the order or repeat elements:
With negative numbers, columns are dropped:
Take the vector
names <- c("Sheldon", "Leonard", "Penny", "Amy")
and reorder it to get the following result:
[1] "Sheldon" "Amy" "Sheldon" "Amy" "Leonard" "Penny"
:-)
Take the vector
names <- c("Sheldon", "Leonard", "Penny", "Amy")
and reorder it to get the following result:
[1] "Sheldon" "Amy" "Sheldon" "Amy" "Leonard" "Penny"
Firstly, we create an example data frame:
Square brackets select a column of a data frame either by a number the column name:
The subsetted object is a data frame with one column.
This is different from extracting a variable with $ or [[ signs:
which returns a vector (!)
While this works:
this throws an error:
Error in median.default(study["age"]) : need numeric data
Providing a vector will select multiple columns:
The extraction of a vector and the selection of elements can be combined:
Or within one step:
Specific cases are selected within square brackets: object_name[rows, columns].
You could also use numbers to address the columns:
Please create a new data frame (study2) comprising the gender and age variables for the cases 1, 3, and 5 of the study data frame.
:-)
Please create a new data frame (study2) comprising the gender and age variables for the cases 1, 3, and 5 of the study data frame.
Subsetting becomes most powerful when it is combined with conditional selections.
For example:
To apply such selections, we have to know about relational and logical operators.
Relational operators compare two values and return a logical value (TRUE or FALSE)
| Operator | Relation | Example |
|---|---|---|
== |
is identical | x == y |
!= |
is not identical | x != y |
> |
is greater | x > y |
>= |
is greater or identical | x >= y |
< |
is less | x < y |
<= |
is less or identical | x <= y |
Only == and != can be applied to non numerical objects:
This behavior is called recycling as is implemented in many (but not all!) R functions.
recycling: An operation is applied to each element of a vector and a vector is returned.
| age | age < 5 |
|---|---|
| 12 | FALSE |
| 4 | TRUE |
| 3 | TRUE |
| 8 | FALSE |
| 4 | TRUE |
| 2 | TRUE |
| 1 | TRUE |
When you put a logical vector within square brackets [ ] after an object, all elements of that object with a TRUE in the logical vector are selected:
| age | x <- age > 5 | Select? | Result |
|---|---|---|---|
| 12 | TRUE | select | 12 |
| 4 | FALSE | drop | |
| 3 | FALSE | drop | |
| 8 | TRUE | select | 8 |
Create a new vector friends <- c(4, 5, 6, 3, 7, 2, 3).
Show all values of that vector >= 4.
:-)
Create a new vector friends <- c(4, 5, 6, 3, 7, 2, 3).
Show all values of that vector >= 4.
The which() functions gives the indices of the elements that are TRUE.
It takes a logical vector as an argument.
which() can handle missing values:
| Index | age | x <- age < 5 | which(x) | age[which(x)] |
|---|---|---|---|---|
| 1 | 12 | FALSE | ||
| 2 | 4 | TRUE | 2 | 4 |
| 3 | 3 | TRUE | 3 | 3 |
| 4 | 8 | FALSE |
Create a vector x <- c(1, 4, 5, 3, 4, 5) and identify:
1. Which elements are larger or equal than three?
2. Create a new vector from x containing all elements that are not four. Note: Use the which() function for this task.
:-)
Create a vector x <- c(1, 4, 5, 3, 4, 5) and identify:
1. Which elements are larger or equal than three?
2. Create a new vector from x containing all elements that are not four. Note: Use the which() function for this task.
Logical vectors can also be appplied to data frames for selecting cases.
Let us take an example data frame:
Select with bracket subsetting or the which() function:
Calculate the mean of IQ for students with and without sen.
:-)
Calculate the mean of IQ for students with and without sen.
Logical operations are applied to logical values.
| Operator | Operation | Example | Results |
|---|---|---|---|
! |
Not | ! x |
TRUE when x = FALSE and FALSE when x = TRUE |
& |
AND | x & y |
TRUE when x and y are TRUE else FALSE |
| |
OR | x | y |
TRUE when x or y is TRUE else FALSE |
Note: To get the | sign:
On a german Mac keyboard press: option + 7
On a german Windows keyboard press: AltGr + <
When applied to vectors, logical operations result in a new vector.
Operations are applied to each element one by one.
Create two vectors:
Determine for each element whether glasses and hyperintelligent are TRUE at the same time.
:-)
Create two vectors:
Determine for each element whether glasses and hyperintelligent are TRUE at the same time.
| glasses | hyperintelligent | glasses & hyperintelligent |
|---|---|---|
| TRUE | TRUE | TRUE |
| TRUE | FALSE | FALSE |
| FALSE | FALSE | FALSE |
| TRUE | TRUE | TRUE |
| FALSE | FALSE | FALSE |
sum() and mean() with logical vectors:When a logical vector is applied to a numeric function (e.g. mean() or sum()), TRUE is counted as 1 and FALSE as 0:
sum() then gives the number of elements that are TRUE.
mean() gives the proportion of elements that are TRUE.
Take the data from the last example and calculate the sum and proportion of cases that wear glasses and are hyperintelligent.
:-)
Take the data from the last example and calculate the sum and proportion of cases that wear glasses and are hyperintelligent.
Create a vector
income <- c(5000, 4000, 3000, 2000, 1000) and a vector
happiness <- c(20, 35, 30, 10, 50).
Use relational and logical operations to determine for each element whether the income is larger than 2500 and at the same time happiness is above 25.
Calculate the proportion.
:-)
Use relational and logical operations to determine for each element whether the income is larger than 2500 and at the same time happiness is above 25.
Calculate the proportion.
| income | happiness | income > 2500 | happiness > 25 | income > 2500 & happiness > 25 |
|---|---|---|---|---|
| 5000 | 20 | TRUE | FALSE | FALSE |
| 4000 | 35 | TRUE | TRUE | TRUE |
| 3000 | 30 | TRUE | TRUE | TRUE |
| 2000 | 10 | FALSE | FALSE | FALSE |
| 1000 | 50 | FALSE | TRUE | FALSE |
… and the proportion
Use the ChickWeight data frame for the following task.
The data set is already included in R.
?ChickWeight.names() function (names(ChickWeight)).Diet == 1 and Time < 16.weight and Time. Note: Use the cor() function (e.g., cor(x, y))Diet == 4.:-)
filter <- ChickWeight[["Diet"]] == 1 & ChickWeight[["Time"]] < 16
diet1 <- ChickWeight[filter,]
cor(diet1[["weight"]], diet1[["Time"]])[1] 0.8109772
filter <- ChickWeight[["Diet"]] == 4 & ChickWeight[["Time"]] < 16
diet4 <- ChickWeight[filter,]
cor(diet4[["weight"]], diet4[["Time"]])[1] 0.9720822
The correlation is larger for Diet 4. This suggests that Diet 4 has a stronger impact an the chicken’s weight.
subset() functionR comes with a function to make subsetting a bit more straight forward.
subset() has the main arguments:
x : A data.framesubset : A logical vector for filtering rowsselect : expression, indicating columns to select from a data frameand returns a data.frame.
Variable names must be provided without quotes and without the name of the data.frame.
Take the mtcars dataset and filter cases (here: car models) with 6 cylinders (variable cyl) and automatic transmission (value 1 in variable am).
Select the variables mpg, am, gear, cyl.
Use the subset function.
:-)
Take the mtcars dataset and filter cases (here: car models) with 6 cylinders (variable cyl) and automatic transmission (value 1 in variable am).
Select the variables mpg, am, gear, cyl.
Use the subset function.
Subset a data frame (and get a new data frame)
Extract a variable from a data frame (and get a numeric or character vector)
For base R data frames this creates a vector:
This should have resulted in a data frame with one variable but is automatically reduced to a vector.
Add drop = FALSE to get standard behavior.
| mpg | |
|---|---|
| Mazda RX4 | 21.0 |
| Mazda RX4 Wag | 21.0 |
| Ferrari Dino | 19.7 |
Some modern implementations of data frames (like tibbles) changed this behavior.
Jürgen Wilbert - Introduction to R